J.K. Rowling and Amitav Ghosh books are among the 183,000-Book Dataset used for AI training: Report

Meta and Bloomberg face scrutiny for using the unauthorized Books3 dataset, which includes novels by renowned authors, to train their AI models.

on Sep 27, 2023

J.K. Rowling and Amitav Ghosh books are among the 183,000-Book Dataset used for AI training: Report | Frontlist

Companies such as Meta and Bloomberg are reported to have used the Books3 dataset without permission to train their generative AI systems.

According to The Atlantic, books by prominent authors such as J.K. Rowling, Amitav Ghosh, Rupa Kaur, and Neil Gaiman are part of a dataset of pirated novels known as Books3, which has been utilized by huge firms to train their generative AI models.

The news outlet published a searchable database of the material called 'Books3,' which allowed readers to look up author names to see if their works were included in the collection.

Companies such as Meta and Bloomberg are reported to have used the Books3 dataset without permission to train their generative AI systems.

When searching for J.K. Rowling, the search engine returned several results because the database includes not only her English-language 'Harry Potter' books, but also foreign language translations. Only a few of the other authors' published novels were listed in the database.

Many authors voiced displeasure on X (previously Twitter) and uploaded screenshots proving that their copyrighted titles were included on the list. Others proposed organizing a class action lawsuit against the mentioned firms.

The Books3 dataset was mentioned in court files by writers who sued Meta and OpenAI, saying that their copyrighted works were pirated and scraped for AI training. Plaintiffs have also sued Google for the same reason.

In the past, OpenAI has defended the use of copyrighted media for AI training, arguing that the fair use theory protects such innovation.

Meanwhile, when Google launched the Bard extensions for Google apps, it stated that data entered through Gmail and Docs would not be used for machine learning.